arXiv:1507.00333v3 [cs.NA] 6 May 2016 


Notes on Low-rank Matrix Factorization 


Yuan Lu, Jie Yang* 

{j oyce.yuan.lu,yangjiera}@gmail.com. 

* Faculty of EEMCS, 

Delft University of Technology, 
Mekelweg f, 2628 CD Delft, the Netherlands. 


Dedicated to Xiao Baobao and Tu Daye. 



1 Introduction 


Low-rank matrix factorization (MF) is an important technique in data sci¬ 
ence. The key idea of MF is that there exists latent structures in the data, 
by uncovering which we could obtain a compressed representation of the data. 
By factorizing an original matrix to low-rank matrices, MF provides a unified 
method for dimesion reduction, clustering, and matrix completion. 

MF has several nice properties: 1) it uncovers latent structures in the data, 
while addressing the data sparseness problem [11]; 2) it has an elegant prob¬ 
abilistic interpretation |15j : 3) it can be easily extended with domain specific 
prior knowledge (e.g., homophily in linked data [isl), thus suitable for vari¬ 
ous real-world problems; 4) many optimization methods such as (stochastic) 
gradient-based methods can be applied to find a good solution. 

In this article we review several important variants of MF, including: 

• Basic MF, 

• Non-negative MF, 

• Orthogonal non-negative MF. 

As can be seen from their names, non-negative MF and orthogonal non-negative 
MF are variants of basic MF with non-negativity and/or orthogonality con¬ 
straints. Such constraints are useful in specific senarios. In the first part of 
this article, we introduce, for each of these models, the application scenarios, 
the distinctive properties, and the optimizing method. Note that for the opti¬ 
mizing method, we mainly use the alternative algorithm, as similar to mnj. 
We will derive the updating rules, and prove the correctness and convergence. 
For reference, matrix operation and optimization can be referred to [2] and [1] 
respectively. 

By properly adapting MF, we can go beyond the problem of clustering and 
matrix completion. In the second part of this article, we will extend MF to 
sparse matrix compeletion, enhance matrix compeletion using various regular¬ 
ization methods, and make use of MF for (semi-)supervised learning by intro¬ 
ducing latent space reinforcement and transformation. We will see that MF is 
not only a useful model but also as a flexible framework that is applicable for 
various prediction problems. 


2 Theory 

This section introduces the theory in low-rank matrix factorization. As intro¬ 
duced before, we will go through the following three MF variations: basic MF, 
non-negative MF, orthogonal non-negative MF. 
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2.1 Basic MF 

We start with the basic MF model, formulated as 


( 1 ) 


mm ||X-UV^||+£(U,V), 

where X £ jg matrix to be approximated, and U £ V £ 

l^nxfc gj.g low-dimensional matrices {k <C min(TO,n)). £(U, V) is a regular¬ 
ization part to avoid overfitting. Regularization is usually necessary in predic¬ 
tion for bias-variance trade-off [3]. 

2.1.1 Gradient Descent Optimization 

We instantiate Eq. [I]as follows 

min G = ||X - UV^III + a||U||^ + ;9||V|1|.. (2) 

The reason of using Frobenius Norm is that it has a Guassian noise inter¬ 
pretation, and that the objective function can be easily transformed to a matrix 
trace version: 


min O = Tr(X^X-t-VU^UV^-2X^UV^)-haTr(U'^U)-h^Tr(V^V). (3) 


Here the matrix calculation rule ||A||i7’ = y^Tr{A’^A) is used in the trans¬ 
formation. Note that trace has many good properties such as Tr{A) — Tr(A^) 
and Tr(AB) = Tr(BA), which will be used in the following derivations. 
According to trace derivatives ~ following rules: 


arr(A^AB) 

dA 

5Tr(AA^B) 


= A(B^-)-B), 
= (B^ -f-B)A 


(see more in 0), we have the following derivatives for U and V, 


( 4 ) 


do aTr(VU^UV^ - 2X^UV^) -h aTr(U^U) 
“ dU 

_ 5Tr(U^UV'^V - 2UV^X^) -H aTr(U^U) 
“ dU 

= 2(UV^V-XV + aU), 

do aTr(VU^UV^ - 2X^UV^) -h ^Tr(V^V) 
W “ dV 

_ 5Tr(V^VU^U - 2V^X^U) -h /3Tr(V^V) 

“ av 

= 2(VU^U - X^U -h aV). 


4 










Using these two derivatives, we can alternatively update U and V in each 
iteration of gradient descent algorithm. 

Note that the derivation can also be performed elementarily for each entry 
in matrix U, V - this is, in fact, the original definition of matrix calculus. Such 
element-wise derivation is especially useful in stochastic optimization. We will 
touch this in a brief discussion of different algorithm schemes next. 

2.1.2 Algorithm Schemes in CF and Others 

For collaborative filtering, usually we take one subset of rated entries in X as 
training set, and the rest rated entries as validation set. Detailed algorithm can 
be found in m- An important implementation strategy is that, for each rated 
entry in the training set, we update an entire row of U and an entire column of 
V^, as the whole row or column is involved in approximating the rated entry. 
Same updating mechanism could be applied in stochastic algorithm. 

In the meanwhile, similarly to stochastic algorithm, this type of updating 
does not fully utilize the data matrix in each updating iteration. The reason 
is that, not only an entire row of U ( and a column of V^) is involved in a 
single entry in data matrix X, but also that a row of U (and a column of V^) 
influences an entire row (column) of X. Therefore for faster convergence, we 
recommend to update the matrix U and V by fully using data matrix X. 

As the objective function is non-convex caused by the coupling between U 
and V, we can choose to alternatively update U and V in each iteration as in 
[IIIS]. Detailed algorithm is similar to the one in m- Within any of these 
matrices, updating should be performed simultaneously as in all gradient-based 
methods. Note that, we still need to choose a small learning rate to ensure that 
the objective function is monotonically decreasing. Interestingly, the alternative 
optimization scheme is even more suitable for non-negative MF [T31 HH 0 0] , 
as we will see in the following subsections. 

2.2 Non-negative MF 

Non-negative MF [I3j seeks to approximate data matrix X with low-dimensional 
matrices U, V whose entries are all non-negative, i.e., U, V > 0. The new prob¬ 
lem becomes: 


min O = ||X - UV^II^ + a||Uf^ + mWl 

s.t. U > 0, V > 0. 


( 6 ) 


Non-negativity constaint is originated from parts-of-whole interpretation 
[18] . As we can think of, many real-world data are non-negative, such as link 
strength, favorite strength, etc. Non-negative MF may uncover the important 
parts, which sometimes can not be achieved by non-constrained MF [T5] . 

Apart from the advantage of uncovering parts, non-negative MF has its own 
computational advantage: there is a relatively fixed method to find a learning 
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rate larger than common gradient-based methods. To illustrate this, we will first 
derive the updating rule for Eq. [6]as an example, then show the general approach 
for proving the convergence of updating rules derived from the relatively fixed 
method. 

2.2.1 Updating Rule Derivation 

The basic idea is using KKT complementary slackness conditions to enforce the 
non-negativity constraint. Based on this, we can directly obtain updating rules. 
The Lagrangian function of Eq. [5] is 


L = ||X - UV^III + a||U||| + P\\Y\\j, - Tr(AiU^) - Tr[K2Y^). ij) 


We have the following KKT condition. 


Ai o U = 0, 
A 2 o V = 0, 


( 8 ) 


where o denotes the Hadamard product. We then have 


dL 


dL 


arr(VU^UV^ - 2X^UV^) arr(U^U) - rr(AiU^) 

9U 

2(UV^V - XV -f oU) - Ai, 

aTr(VU^UV^ - 2X'^UV^) -h /3Tr(V^V) - Tr(A 2 V^) 

av 

2(VU^U - X^U -I- PW) - A 2 . 


(9) 



0 and ^ = 0 as another KKT condition, we have 


Ai = 2(UV^V - XV aU), 
A 2 = 2(VU'^U - X^U -k /3V). 


( 10 ) 


Now we combine Eq. |5]and Eq. (TUI we have 


(UV^V - XV -f aU) o U = 0, 
(VU^U - X^U -k PY) o V = 0. 


( 11 ) 


from which, we have the final updating rules, 



( 12 ) 
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Detailed algorithm using these rules is similar to the one in m- We can see 
that, instead of manually setting small learning rates A’s, Eq. [indirectly offer 
updating rules that can usually lead to faster convergence. 

The correctness of these updating rules is straightforward to find out. Taking 
U as an example, from Eq. Unwe have either U = 0 or UV^V-XV + aU = 0, 
which combined together, exactly equal to Eq. 1111 The convergence, however, 
is somehow more difficult to be proved. We leave this to the next subsubsection. 

2.2.2 Proof of Convergence 

We prove the convergence of the updating rules in Eq. Unwith the standard 
auxiliary function approach, which is proposed in m and extended in 011]. 
Our proof is mainly based on 00, although the objective function Eq. 0is 
slightly different. 

An auxiliary function G(U, U*) of function L(U) is a function that satis¬ 
fies 

G(U, U) = L(U), G(U,U‘) > L(U). (13) 

Then, if we take such that 

U*+i = arg min G(U, U*), (14) 

u 

we have 

L(U‘+i) < G(U*+\ U*) < G(U‘,U* < A(U‘)). (15) 

This proves that L{U) is monotonically decreasing. 

Turn back to our problem, we need to take two steps using auxiliary function 
to prove the convergence of updating rules: 1) find an appropriate auxiliary func¬ 
tion, and 2) find the global minima of the auxiliary function. As a remark, the 
auxiliary function approach in principle is similar to Expectation-Maximization 
approach that is widely used in statistical inference. Now let us complete the 
proof by taking the above two steps. 

Step 1 - Finding an appropriate auxiliary function needs to take 
advantage of two inequalities. 


z > 1 + logz, Vz > 0, 


(16) 




i=l 3 = 1 








^mxk g ^ j^mxA: 


(17) 


The proof for Eq. |T7|can be found in 0 (Proposition 6). 

After removing irrelevant terms, the objective function Eq. |6|in terms of U 
can be written as 


rr(VU^UV^ - 2X^UV^) -b aTr(U'^U) 

=rr(U^UV'^V - 2U^XV) -f aTr{V^V) (18) 
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We now propose an auxiliary function 


G(U,U‘) = -2^(XV)(*,j)U‘(*,j)(l + 

(191 

^ (U*V^V)(^,J^)U(^,Jr , 

^ ■ 

Combining the two inequalities Eq. [111113 it is straightforward to see that 
Eq. llHis a legal auxiliary function for Eq. [THl i.e., the two conditions in Eq. |T3] 
are satisfied. Now we procceed to find that satisfies condition Eq. [TH 
Step 2 - Finding can be achieved by obtaining the global minima of 

Eq. [TH Eirst, we have 


aG(U,U*) 


2(XV)(*,j) 


U*(bj) 


(U*V^V)(»,j)U(»,j) 


2aU(z,j). 

( 20 ) 


Let ^ = 0 we have 

au(2j) 




U *+1 (*,.!•) 

from which we directly have 


= ( 


(W^V)(»,j) 


+ a)U‘+i(qj), (21) 


U*+^(*,i) = U‘(*,j)i 


(XV)(z,j) 


(U‘V^V + aU*)(qj)’ 


( 22 ) 


which is exactly the updating rule for U in Eq. [T^l Similar result can be 
obtained for V. 


General observation If we go over the entire derivation process, by compar¬ 
ing Eq. [22] and Eq. [TTl we can observe that the only thing that matters for the 
final updating rules is the signs of the terms in Eq. Illl 

2.3 Orthogonal Non-negative MF 

Orthogonality is another important constraint to MF. First of all, we formulate 
the problem as 


min 0= ||X-UV'^||l 

u.v 

s.t. U, V > 0, U^U = I, V'^V = I. 


(23) 


Note that here we do not add regularization due to the orthogonality constraint. 
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It is proved in El [S] (0 gives more mature proof) that this problem is 
equivalent to K-means clustering: V' is an indication matrix with = 0 

if Xi belongs to the (1 < j < k) cluster. Here V = i.e., V 

is a normalized version of V': V' is a constant scaling of corresponding row of 
V, and ||V(:,j)||i = l. 

2.3.1 3-factor MF vs. 2-factor MF 

We call Eq. [53] 1-sided 2-factor orthogonal non-negative MF, as only one fac¬ 
torized matrix needs to be orthogonal, and there are in total two factorized 
matrices. It is recommended that, to simultaneously cluster rows and columns 
in X, we need 3-factor bi-orthogonal non-negative MF, i.e., both U and V 
being orthogonal: 


min O = ||X-UHV'^||i 

U,H,V 

s.t. U, H, V > 0, U^U = I, V^V = I. 


(24) 


It is proved that, compared to 3-factor bi-orthogonal non-negative MF, 2- 
factor bi-orthogonal non-negative MF is too restrictive, and will lead to poor 
approximation 0. 

3-factor bi-orthogonal non-negative MF is useful in document-word cluster¬ 
ing 0, outperforming K-means (i.e., 1-sided 2-factor orthogonal non-negative 
MF). It has been applied for tasks such as sentiment analysis [TO] . 


2.3.2 Updating Rule Derivation 

We now derive updating rules for Eq. |24l as we did before for non-negative 

MF. 

The Lagrangian function for Eq. [24] is 


L =||X - UHV^III - rr(Ac/U^) - Tr(AffH^) - Tr{KvY^) 
+rr(r[,(U^U - I)) + Tr(ry(V^V - I)) 

We then compute the updating rules for H, U, V sequentially. 

Computation of H 

dL aTr(VH^U^UHV'^ - 2XVH^U^) - Tr[KH^^) 

M “ 5H (26) 

= 2U^UHV^V - 2U^XV - Kh, 

We have the following KKT conditions, 

^ = 0 

9H (27) 

Kh o H = 0. 
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Combining the above three equations, we have 


(U^UHV^V - U^XV) o H = 0. (28) 


Therefore we have the following updating rule for H, 




(U^XV)(z,j) 

(UT’UHV?’V)(i,j')' 


(29) 


Note that U^U ^ I during the optimizing process. 

Computation of U,V 

Due to the orthogonality constraint, obtaining the updating rules for U, V 
needs to eliminate both A and F in the final updating rules. This will need the 
following equality. 


U^A[/ = 0-<=A[/oU = 0 (30) 

The latter will automatically be satisifed according to KKT conditions as we 
will see below. 

dh 9Tr(VH^U'^UHV^ - 2XVH^U^) - Tr(A[/U^) + Tr(rc/(U^U - I)) 
mi ~ 5U 

= 2UHV^VH^ - 2XVH^ - Au + 2\]Tu, 

(31) 

We have the following KKT conditions, 

dL _ 

° (32) 

Au o U = 0. 


Combining the above three equations we have 

(UHV^VH"^ - XVH^ + VTu) o U = 0 (33) 


and 


Tu = U^XVH^ - HV^VH^. (34) 

Note that here we can have U^U = I as we only want an expression for 
Fy. Further note that for A we have the constraint A > 0 (according to KKT 
condition) while for F we do not have such constraint. Therefore we need to 
split F into two parts, 

Tu=r+- F^ 

F+ = (|Fc;|+Fc;)/2 (35) 

F0 = (|Fc;|-Fc;)/2. 
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Using this division we rewrite Eq. 1331 we then have 


(UHV^VH"^ - XVH^ + Ur+ - UE") o = 0. (36) 


Therefore the final updating rule for U is 



(37) 


where and is defined in Eq. [Ml and 1551 

If we go over the same process again for V, we have the following updating 
rules, Therefore the final updating rule for U is 



(38) 


where T^^, T^ are defined similarly as in Eq. [M] (replace U with V), and Ty is 
defined as 


Tv = V^X^UH + H^U^UH. 


(39) 


Choice of 2/3-factor MF How do we choose between 2-factor or 3-factor 
MF in real-world applications? A general principle is that: if we only need to 
place regularizations on one latent matrix, i.e. either U or V, then we can use 

2- factor MF; if both U and V are to be regularized, either explictly or implictly, 

3- factor MF might be a better choice. 

3 Adapatations and Applications 

MF has been used for a wide range of applications in social computing, including 
collaborative filtering (CF), link prediction (LP), sentiment analysis, etc. It can 
not only provide as a single model for matrix completeion or clutering, but also 
as a framework for solving almost all categories of prediction problems. 

In this part we will extend MF to highly sparse cases. For the cases in which 
we have additional data, e.g. link data between users (in CF, or addtional links 
in LP) or description data of users and items, we can incorporate different regu¬ 
larization techniques to enhace the matrix completion performance. Moreover, 
by properly manipulating latent factors derived from MF, we can adapt MF 
to (un-/semi-)supervised learning. 

3.1 Sparse Matrix Completion 

Here we address the problem of using MF for collborative filtering, link pre¬ 
diction and clustering. We start with a basic assumption, which makes the 
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previously introduced models unsuitable. This basic assumption is: high por¬ 
tion of the data is missing, i.e. data matrix is incomplete. Such assumption is 
very common in real-world cases |12] . 

The problem is solved by modeling directly the observed data. Eq. [T] is 
modified as follows: 

min O = ||0 o (X - UV^)||?. + a||U||| + /3||V|||, (40) 

in which O poses constraints on only these observed data entries, i.e. 0(i, j) = 1 
if entry (i, j) is observed, and 0{i,j) = 0 otherwise. 

In this case, the objective function is transformed as follows: 

min O = Tr((0^ o X'^)(0 o X) -f (O^ o VU'^)(0 o UV^) 

u,v (-4^^ 

- 2(0^ o X'^)(0 o UV^)) -h arr(U^U) -b l3Tr{V^V). 

And the gradients become: 

dTr{{0^ o VU^)(0 o UV^) - 2(0^ o X^)(0 o UV^)) -b aTr{V'^\J) 

9U 

dTr(U^{0 oOo UV^)V - 2(0^ o o X'^)UV^) -b aTr{V^V) 

ffU 

2((0 o O o UV^)V - (O o O o X)V -b aU), 

2((0^ o o VU'^)U - (O^ o o X^)U -b pv). 

(42) 

In the derivation above we use the following rule of Hadamard product: 

rr((0'^ o A^)(0 o A)) = Tr(A^(0 o O o A)). (43) 

The upodating rules for non-negative MF and orthogonal non-negative MF 
is straightforward: the methods of getting A, T are exactly the same as what we 
did in Theory Section. For updating rules of non-negative MF and orthogonal 
non-negative MF, the reader can refer to [7] and [5], respectively. 

3.1.1 Calculating Memory Occupation 

Note that the updating rules above are again purely matrix-wise - this is to be 
consistent with the style of this article. In matrix completion, however, some¬ 
times the size of the data matrix is bigger than memory size, making stochasitc 
gradient descent algorithm more suitable than the matrix-wise method. 

The question here is, how do we calculate the size of a matrix to see if it 
fits to memory. Here is a easy way to make such a calculation. Assume we 
have a lOA x lOAT matrix, with each entry allocated a 32bit float (e.g. fioat32 
in python), then the memory allocation for the whole matrix can be roughtly 
calculated as 

(10"^ X 10"^ X 4)/10® = 400M. 


do 

dV 


do 

dV 
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So for a computer with 4G memory, we can fit a matrix IQQK x IQK matrix 
into memory. For a computer with 32G memory, we can fit a matrix of size 
lOOiF X mK (10 X 8 X 400M = 32G). 

3.2 Enhanced Matrix Completion 

We looked at MF with different constraints, e.g. non-negativity and orthogality, 
and one type of regularization which prevents the entries in low-rank matrices 
being too large. This subsection considers other kinds of regularization when 
external data source becomes avaiable, i.e. goes beyond the data matrix X. 
Usually this is the real-world case, since most social media data contains rich 
data sources. 

In this subsection we consider two types of regularization with corresponding 
addtional data: 

1. self-regularization when we have additional linked data between users 
(in GF, or addtional link type in LP); 

2. 2-sided regularization when we have description data of users and 
items. 

We further point to two publications [19] and [8], to demonstrate the above 
two types of regularization, respectively. 

3.2.1 Enhancing Matrix Completion with Self-regularization 

By self-reguarization, we refer to the regularization of rows in low-rank matrix 
U or V. Assume now we are dealing with a LP problem, in which we would like 
to predict if a user trust another - trust relation are common in review sites like 
Epinions. Usually there exist another type of links between users, i.e. social 
relation. Can we use social relation to boost the performance of trust relation 
prediction? This is exactly the research question proposed in m- 

It turns out the answer is yes - as expected, users with social relation tend to 
share similar preferences. The basic idea to incorporate this into trust prediction 
is by adding the regularization term Eq. |44|into the general MF framework. In 
Eq. [44] ^ is the entries in the additional link matrix Z and D is the diagonal 
matrix with D(i,i) = ZJL-^{j,i), thus C is the Laplacian matrix of D. It is 
interesting that, using trace operator, the regularization Eq. HD become such 
simple. 

Social relation is common in social computing, the similarity in people with 
social relation has a specific name in social theory - ‘homophily’, making this 
type of regularization applicable to a lot of social computing scenarios. If we 
generalize a bit, we may assume that many linked objects, not necessarily web 
users, have similarities, in terms of their entries of data matrix X that we would 
like to predict. For instance, while predicting the sentiment of articles, we 
may assume that articles authored by the same users tend to express similar 
sentiment, e.g. political reviewers expressing negative sentiment in their news 
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reviewing articles. We will see that this type of regularization is used in a 
sentiment analysis paper |10j . which we will analyze later. 


^EEe(*,j)l|U(*,:)-U(j, :)||2 

i=i j=i 
^ m m k 

2—1 j—1 d—1 
^ m m k 

= 9 E E E k) - 2U(*, fc)U(j, k) + V^J, k)) 

(44) 

m m k m m k 

=EEE EEE at,j)u{i,k)vij,k) 

i^l j=l d^l i^l j^l d^l 

k 

= ^U^(:,A)(D-Z)U(:,fc) 

d^l 

=rr(U^£U) 

Regularization and Sparseness More regularization sometimes can con¬ 
quer the data sparsity problem, to some extent. On the other hand, modelling 
the error only on observed data entries, as what O does in previous subsection, 
could be also very effective. 

3.2.2 Enhancing Matrix Completion with 2-sided regularization 

Here we consider placing regularization on both U and V together, which we 
call 2-sided regularization. 

Before we start, we review orthogonal non-negative MF a bit. Orthogonality 
constraint in orthogonal non-negative MF is similar to a 2-sided regularization: 

Tr(r?;(U^U - I)), rr(r^(V^V - I)) 

are two equality constraints over low-rank matrices. Such equality needs to be 
strictily satisfied. Regularization, differing from constraints, however 
can be viewed as a soft type of constraints: it only needs to be satisfied 
to some extend, while constraints need to be strictly satisified. This is the 
reason why we consider non-negativity and orthogonality constraints, while call 
homophily regularization. 

Now let us turn our attention back to 2-sided regularization, basing the 
example from [8], which considers POI recommendation in location-based social 
network (LBSN). The first data we have is a check-in data X that encodes the 
interaction between users and POI’s. We are further given some desription data 
A of user interest, and B of POI property, both in the form of word vectors. 
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Question here is, how do we make use of A and B to enhance the matrix 
completion problem for interacting matrix X? 

Since we are coping with 2-sided regularization, we use 3-factor MF: 

jmn^ O = ||X - UHV^III - Tr(A^U^) - rr(A^H^) + R's. ( 45 ) 

The only thing here is, how to add the 2-sided regularization terms i?’s, as we 
did for orthogonality constraints. 

To utilize A and B, we assume that there are some connections between 
them, such that they can be used to regularize U and V. In the context of 
LBSN, we may assume that A and B have similar vocabulary, in which the 
words have similar latent space. Therefore we can approximate A and B with 
2-factor MF: 

A UG^,B « VG*^ (46) 

with connection 

||G-G*||i«0. (47) 

Eq. [43 is important since it really connect U with V, forming a 2-sided 
regularization. The final objective function now becomes: 

^imn^ O = ||X - UHV^III - Tr(A[/U^) - TriAnll^) 

+ Aa||A - UG^III + AsllB - UG*^||| + r5||G - G*||i (48) 

+ a(||U||| + ||V||| + ||H||^+||G|||). 

The last line is to regularize in approximating A, B; note that since here we 
use regularization, instead of constraints as in non-negative orthogonal MF, we 
can add regualrization to U, V, H. 

Factorization vs. Regularization We remark here that the idea of co¬ 
factoring two matrices (X, A) with shared factors (U) originates from collective 
matrix facterization m, which has many applications in CF |16] . A interesting 
comparative study between collective facterization and self-regularization can 
be found in m- 

3.3 Prom Clustering to (Un-/Semi-)supervised Learning 

Although different types of extra data sources can be used in enhanced MF, 
the purpose so far to remains be matrix completion. This subsection, however, 
considers other types of machine learning problems, i.e. (un-/semi-)supervised 
learning. The essential assumption of using MF for (un-/semi-)supervised learn¬ 
ing is that the latent row(column) is or can be predictable for some dependent 
variables. 

To make use of the predictability, we need mechanisms to connect the latent 
vectors to responses. Following are the two mechanisms: 
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1. enforcement directly enforce the latent space to be the response space; 

2. transformation transform the latent space to response space. This is 
similar as what people do in machine learning. 

We point to publications [10] and |6| for the demonstration of the above two 
methods, respectively. 

3.3.1 Enforcing Latent Factor to be Response 

In previous regularizations, we do not force the latent space to be interpretable 
space. For instance, in the 2-sided regularization, we do not specify the mean¬ 
ing of U that is used in both X and A factorization. However, (un-/semi- 
)supervised learning requires the latent space to be interpretable. The method, 
still, is regularization. 

uni deals with the problem of sentiment analysis, for which the authors use 
3-factor non-negative orthogonal MF. The input is a post-word matrix X. In 
addition, we are given emotion indication in some of the posts. “The key idea 
of modeling post-level emotion indication is to make the sentiment polarity of a 
post as close as possible to the emotion indication of the post.”, formulated as 

G“||U-Uo||^, 

in which U G is the post-sentiment matrix, i.e. U(i,:) = (1,0) rep¬ 
resenting that the ith post has a positive sentiment, and Uq G is the 

post-emotion indication matrix, i.e. Uo(i,:) = (1, 0) meaning the ith post con¬ 
tains positive emotion indication. Similar regularization is applied to V as well. 

Such an idea is quite simple, however it explictly poses a notable question: is 
it computationally feasible that we strictly enforce the U, V to any pre-defined 
space, i.e. sentiment space in this case. Based on Proposition 1 in j^, we know 
that the answer is no. However, as we see in this sentiment analysis work [10], 
regularization is always possible! 

In fact, the enforcement regularization that we see in this work is the most 
constrained regularization: it is 2-sided regularization for both U, V, and it is 
enforcement without any transformation coefficients. We will see next how to 
regularize for supervised learning by tranformation. 


3.3.2 Transforming Latent Factor to Response 

As we pointed out, the essential idea of supervised learning is to transform the 
latent variables to some response variable. To see this, we study an example 
that exploit matrix factorization to boost (sparse) regression. 

Here we solve the following optimization problem: 


^min^ ||X - UV^Ill + A||0 0 (UW^ - Y)||| 
+ Ax(||U|||. + ||V|||) + Ak||W||i. 


(49) 
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Optimizing the objective function accomplishes two goals simultaneously: 
1 ) learning the latent factors; and, 2) predicting the dependent variables based 
on the learnt latent factors. As the learning of U is guided by the prediction 
of Y (proved later), the learned latent factors can be more predictive in the 
regression. Note that the parameter A controls the relative importance between 
matrix factorization and regression - a larger A indicates that the regression 
should dominate. 

O is a mask vector with the first ritrain ~ the size of training set - entries 
equal to 1, and the other ritest - the size of test set - entries equal to 0. Corre¬ 
spondingly, X contains both the training data in the first Utrain rows and the 
test data, in the remaining ntest rows. Y is also composed of two parts, the 
first Utrain entries being the complexity values of the training tasks; the other 
entries can be any values, as they are not involved in model learning, which is 
controlled by the O’s in O. 

The Lagrangian function of the objective function is 

L = ||X - UV'^lll. -t A||0 © (UW^ - Y)\\% 

+ Ax(||U|||. -t ||V|||.) -t Ay||W||i - Tr(AcU'^) - TriAvV^). 

The derivative of U is: 

dL arr(VU^UV^ - 2X^UV^) 

w “ au 

d\Tr{{0'^ 0 WU^)(0 0 UW"^) - 2(0^ © Y'^)(0 © UW^)) 

au 

aAxrr(U^U) - rr(Ac/U^) 

au 

= 2(UV^V - XV -t A(0 © (UW^))W - A(0 © Y)W + AxU) 

-Ac/. 


The derivative of V is: 

dL _ arr(VU^UV^ - 2X'^UV'^) -t \xTr(V'^V) - Tr(AvV'^) 
dV ^ av (52) 

= 2(VU^U - X^U + AxV) - Av. 


Note that for W the problem becomes a classic Lasso problem, we can 
update it use standard algorithm such as LARS. 

According to the KKT conditions: 


^= 0^=0 

au ’ av ’ 


Ac/ © U = 0, Av © V = 0. 


(53) 


We have 

(UV^V - XV -f A(0 © (UW^))W 

-A(O©Y)W + AxU)©U = 0, (54) 

(VU^U - X'^U -t Ax V) © V = 0. 
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It leads to the following updating rules for U, V: 




(XV + A(0©Y)W)(i,i) 


(UV^V + A(0 © (UW^))W + AxU)(i,i) ’ 


(X^U)(i,i) 


(55) 


(VU^U + AxV)(i,j) 


3.3.3 A Comprehensive Model 

Here we review an application of [B] that integrates the methods of enforce¬ 
ment and transformation. In this application, we would like to model a user’ 
attitude towards some controversial topic, reflected by his opinion, sentiment 
and retweeting action. We are given a retweeting matrix X representing users’ 
retweeting action to some tweets, and we would like to predict users’ opinion O 
and sentiment P, and the task is to predict these three variables given the user 
feature F. 

We first introduce how the model is built in [^, then discuss other alterna¬ 
tives. To train such a model, the authors propose the following model 


min 0 = ||X - (FW^)V^III -f Ai||FW^ - 0||^ + A 2 ||(FW^)S - P||^ 

+ AsIlIFIIi + allWIII + /3|1V||| + 7||S||| - Tt{A,V^) - TriK^V^), 

(56) 

in which Ai||FW^ — 0 |||. and A 3 ||IT||i models opinion from the user feature 
by bringing in the classical linear regression. We can see that modeling the 
sentiment is also straightforward: A 2 ||FW^S — P|||n simply transfers again the 
user feature with a linear transformation S. The retweeting matrix X, similarity, 
also using FW^ as the latent vectors. 

To summarize, the model Eq. [BB] bases the prediction of retweeting action, 
opinion and sentiment all on the user features. If we make Ai to be infinitely 
large, meaning that we enforce FW^ = O, then in fact, X w OV^ and OS « P. 
Such choice is based on the assumption that opinion drives both the retweeting 
action and sentiment. 

Model Eq. [BB] is an comprehensive model, in the sense that the subtask 
of matrix completion, cluatering and regression are fused together, by basing 
all prediction on user feature transformation. What if we are not given the use 
feature information? Instead, we directly model the relation between retweeting 
action, opinion and sentiment. A straightforward model could be 


min O = ||X - UV^III + Ai||U - 0||| + A2IIUS - P|l|. 

+ a\Ml + /3||Vf^ -f 7||S||^ - rr(AiU^) - Tr{K^Y^). 


(57) 
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